final verdict
ALARB: An Arabic Legal Argument Reasoning Benchmark
Shairah, Harethah Abu, AlHarbi, Somayah, AlHussein, Abdulaziz, Alsabea, Sameer, Shaqaqi, Omar, AlShamlan, Hebah, Knio, Omar, Turkiyyah, George
We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well as the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset's utility for instruction tuning. Notably, we show that instruction-tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.
- Asia > Middle East > Saudi Arabia (0.49)
- North America > United States > Florida > Miami-Dade County > Miami (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- (10 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
ProdRev: A DNN framework for empowering customers using generative pre-trained transformers
Following the pandemic, customers, preference for using e-commerce has accelerated. Since much information is available in multiple reviews (sometimes running in thousands) for a single product, it can create decision paralysis for the buyer. This scenario disempowers the consumer, who cannot be expected to go over so many reviews since its time consuming and can confuse them. Various commercial tools are available, that use a scoring mechanism to arrive at an adjusted score. It can alert the user to potential review manipulations. This paper proposes a framework that fine-tunes a generative pre-trained transformer to understand these reviews better. Furthermore, using "common-sense" to make better decisions. These models have more than 13 billion parameters. To fine-tune the model for our requirement, we use the curie engine from generative pre-trained transformer (GPT3). By using generative models, we are introducing abstractive summarization. Instead of using a simple extractive method of summarizing the reviews. This brings out the true relationship between the reviews and not simply copy-paste. This introduces an element of "common sense" for the user and helps them to quickly make the right decisions. The user is provided the pros and cons of the processed reviews. Thus the user/customer can take their own decisions.
GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs
Song, Mingyang, Zheng, Mao, Luo, Xuan
Using Large Language Models (LLMs) to evaluate and compare two answers from different models typically involves having LLM-based judges select the better answer. However, humans often approach problem-solving from a reverse perspective, for instance, by choosing the worse option instead of the better one in a pairwise comparison. Generally, this kind of reverse thinking plays a crucial role in human reasoning and decision-making and can further test the difference between original and reverse thought processes simultaneously. To address the above issue, in this paper, we propose a Goal-Reversed Prompting (GRP) approach for pairwise evaluation that shifts the original task from selecting the better answer to choosing the worse one. We encourage LLMs to think in reverse by prompting LLMs to identify the worse response. Experiments on closed-source models demonstrate that GRP significantly enhances evaluation capabilities, outperforming the prompt template with the original goal.
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Saha, Swarnadeep, Li, Xian, Ghazvininejad, Marjan, Weston, Jason, Wang, Tianlu
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
'Roswell: The Final Verdict' Review: Aliens vs. Artificial Intelligence
The recent emergence of U.S. Navy videos of UFOs--and the fact that the government is addressing them seriously--will no doubt generate larger than average buzz around "Roswell: The Final Verdict," although the title suggests something like "Final Destination 6": Will the question of intergalactic life ever really be resolved until extraterrestrials can walk comfortably among us? "Final Verdict" is hooked to the 74th anniversary of the incidents at Roswell. It's safe to expect similar celebrations next year. Meanwhile, this Discovery production is an ambitious if somewhat overheated summing-up of what happened near the New Mexico city in 1947, the stuff of both scientific speculation and folklore: Did the government cover up the crash landing of an alien spaceship, replete with otherworldly visitors? Or did the "witnesses" who claimed that it all happened construct an elaborate hoax?
- North America > United States > New Mexico (0.27)
- North America > Mexico > Mexico City > Mexico City (0.27)
- Government > Military (0.75)
- Government > Regional Government > North America Government > United States Government (0.39)